import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score, train_test_split, KFold
from scipy import stats
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
import itertools
import time
import warningsChicago PD: Predicting & Inferring Police Misconduct
Team name
0.1 Data quality check / cleaning / preparation
Put code with comments. The comments should explain the code such that it can be easily understood. You may put text (in a markdown cell) before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. Put the name of the person / persons who contributed to each code chunk / set of code chunks. An example is given below.
0.1.1 Data quality check
By Ryan Payne
The code below visualizes the distribution of all the variables in the dataset, and their association with the response.
#...Distribution of continuous variables...##...Distribution of categorical variables...##...Association of the response with the predictors...#0.1.2 Data cleaning
By Ryan Payne
From the data quality check we realized that:
- The data did not have a discplined response column and so we had to construct it from the observations in “final_outcome”.
- The data came in seperate csv files. For example, a csv file containing roster information, regional demographics, officer salaries, victim information and more. We needed to decide which of these csv files to use. We ended up deciding (based on missing data criteria) that’d we’d focus on only those data pertaining to the officers and not victims. These datasets needed to be merged with one another on the officer’s unique ID numner.
- The final_outcome column had double entries: once in lower case and once in upper case that needed to be combined.
- In general, there were lots of formatting issues that needed to be resolved.
- There were cateogries with very few (less than 0.01%) instances that needed to be either removed or consolidated. For example, we consolidated all gang related units in unit_description into one unit category.
The data came in three seperate csv files that needed to be merged and cleaned before we could begin our ananlysis. These three seperate files contained information on (1) the accused police officers (their unit, age, etc.) (2) salary information and (3) other roster information. The below cells walk through how these three files were combined and cleaned to create our final dataset.
0.2 Accused Data
# ACCUSED DATA
accused_df = pd.read_csv('./project_data_unified/complaints/complaints-accused.csv')
# Remove rows for which final_outcome = NaN
accused_df = accused_df.dropna(subset=['final_outcome'])
# Remove rows for which final_outcome = "Unknown" (pretty much same thing as NaN)
accused_df = accused_df.loc[accused_df['final_outcome'] != 'Unknown']
# Rename complaint_category as complaint_description, per reference data
accused_df = accused_df.rename(columns={'complaint_category':'complaint_descr'})
# Drop reccommended finding and outcome bacause (a) >50% NaN and (b) mostly redundant
accused_df = accused_df.drop(columns={'recc_finding', 'recc_outcome'})
# Combine like values
accused_df['final_outcome'] = accused_df['final_outcome'].astype(str)
accused_df['final_outcome'] = accused_df['final_outcome'].apply(lambda x: x.replace('NO ACTION TAKEN', 'No Action Taken'))
accused_df['final_outcome'] = accused_df['final_outcome'].apply(lambda x: x.replace('**PENALTY NOT SERVED', 'Penalty Not Served'))
accused_df['final_outcome'] = accused_df['final_outcome'].apply(lambda x: x.replace('REPRIMAND', 'Reprimand'))
accused_df['final_outcome'] = accused_df['final_outcome'].apply(lambda x: x.replace('SEPARATION', 'Separation'))
accused_df['final_outcome'] = accused_df['final_outcome'].apply(lambda x: x.replace('REINSTATED COURT ACT', 'Reinstated by Court Action'))
accused_df['final_outcome'] = accused_df['final_outcome'].apply(lambda x: x.replace('RESIGNED -NOT SERVED', 'Resigned'))
accused_df['final_outcome'] = accused_df['final_outcome'].apply(lambda x: x.replace('REINSTATED POLICE BD', 'Reinstated by Police Board'))
# treat '15 Day Suspension' same as '15 DAY SUSPENSION', for each #-day combination
import re
accused_df['final_outcome'] = accused_df['final_outcome'].astype(str)
def standardize_case(s):
# Check if the string contains a number
if any(char.isdigit() for char in s):
# Extract the numeric part of the string using regular expressions
num_str = re.findall(r'\d+', s)[0]
# Use regular expressions to replace the text part with a lowercased version
text = re.sub(r'[a-zA-Z]+', lambda m: m.group(0).lower(), s.replace(num_str, ''))
# Combine the standardized text and numeric parts of the string
return num_str + ' ' + text.strip()
else:
return s
# Update final_outcome
accused_df['final_outcome'] = accused_df['final_outcome'].apply(standardize_case)FileNotFoundError: [Errno 2] No such file or directory: './project_data_unified/complaints/complaints-accused.csv'
Next, we needed to create our response variable (disciplined = 1 or discplined = 0) from the categories in “final outcome.” We decided not to treat “Reprimand” as a form of discpline, since it often used as a token punishment. Importantly for the data, however, reprimands make up the majority of discplinary actions. It’s removal, therefore, reduces the number of positive responses (discplined = 1) quite substantially.
# Filter out all the final_outcomes that DON'T count as discpline
disc_outcomes = accused_df['final_outcome'].value_counts().index.difference(['No Action Taken', 'Reprimand', 'RESIGNED -NOT SERVED',
'SUSTAINED-NO PENALTY', 'Violation Noted', 'Penalty Not Served',
'Reinstated by Police Board', 'Resigned'])
# Create the binary, discplined or not, outcome variable
accused_df['disciplined'] = accused_df['final_outcome'].apply(lambda x: 1 if x in disc_outcomes else 0)Just as with “discplined”, we also needed to create a our second response variable, sustained (sustained = 1 and sustained = 0) from the categories in “final_finding”
# Create sustained or not sustained binary (final_find == NS)
accused_df['final_finding'].value_counts()
accused_df['sustained'] = accused_df['final_finding'].apply(lambda x: 1 if x == 'SU' else 0)
# Keep in mind, final_finding has other categories The accused_df contains a column “complaint_code” which corresponds to a complaint category, such as false arrest and unfit for duty. These description-code pairings were in another file that needed to be merged with accused_df on “complaint_code”
# Complaint categories
complaint_ids = pd.read_excel('./project_data_unified/Complaint_Categories.xlsx')
complaint_ids = complaint_ids.rename(columns={111:'complaint_code', 'CATEGORY':'complaint_category',
'ON / OFF DUTY':'on_off_duty'})
# Collapse Excessive use of force into use of force
complaint_ids['complaint_category'] = complaint_ids['complaint_category'].apply(lambda x: x.replace('Excessive Force', 'Use of Force'))
# Combine racial profiling and first amendment into 'Other', since so few n
complaint_ids['complaint_category'] = complaint_ids['complaint_category'].apply(lambda x: x.replace('Racial Profiling', 'Other'))
complaint_ids['complaint_category'] = complaint_ids['complaint_category'].apply(lambda x: x.replace('First Amendment', 'Other'))
# Add ON/OFF duty and complaint category to accused_df from reference data
accused_df = pd.merge(accused_df, complaint_ids.loc[:,['complaint_code', 'on_off_duty', 'complaint_category']])/Users/ryanpayne/opt/anaconda3/lib/python3.9/site-packages/openpyxl/worksheet/_reader.py:296: UserWarning: Failed to load a conditional formatting rule. It will be discarded. Cause: expected <class 'openpyxl.worksheet.cell_range.MultiCellRange'>
warn(msg)
1 Roster
Most of the cleaning performed below consists of renaming, replacing, and consolidating certain values to (a) make the data intepretable both to us and statmodels’ functions and (b) remove as much imbalance in particular categories as appropriate.
# Police roster information
roster_df = pd.read_csv('./project_data_unified/roster/roster_1936-2017_2017-04.csv')
star_nums = ['star1', 'star2', 'star3', 'star4', 'star5', 'star6', 'star7', 'star8', 'star9', 'star10']
to_drop = ['last_name_NS','first_name_NS', 'middle_initial', 'middle_initial2', 'suffix_name','roster_1936-2017_2017-04_ID',
'row_id', 'merge', 'birth_year', 'resignation_date', 'link_UID']+star_nums
# Drop columns
roster_df = roster_df.drop(columns=to_drop)
# Combine all police officers into one category
# Identify rows containing 'PO' and replace with 'POLICE OFFICER'
mask = roster_df['current_rank'].str.contains('PO') & ~roster_df['current_rank'].str.contains('POLICE OFFICER') & ~roster_df['current_rank'].str.contains('PO AS DETECTIVE')
roster_df.loc[mask, 'current_rank'] = 'POLICE OFFICER'
# Manually rename renaming "PO" entries
roster_df['current_rank'] = roster_df['current_rank'].apply(lambda x: x.replace('P O ASSGN SEC SPEC', 'POLICE OFFICER'))
roster_df['current_rank'] = roster_df['current_rank'].apply(lambda x: x.replace('P.O. ASSIGNED AS HELICOPTER PILOT', 'POLICE OFFICER'))
roster_df['current_rank'] = roster_df['current_rank'].apply(lambda x: x.replace('DEP CHIEF', 'CHIEF'))
# Collapse the rest of the ranks into Other
to_replace = roster_df['current_rank'].value_counts().index.difference(['POLICE OFFICER', 'PO AS DETECTIVE', 'UNKNOWN', 'COMMANDER',
'DEP CHIEF', 'CHIEF'])
replace_dict = {level: 'Other' for level in to_replace}
roster_df['current_rank'] = roster_df['current_rank'].replace(replace_dict)
# Reduce number of unit descriptions
counts = roster_df['unit_description'].value_counts()
roster_df.loc[roster_df['unit_description'].isin(counts[counts < 50].index), 'unit_description'] = 'Other'
# Rename columns
roster_df = roster_df.rename(columns={'gender':'officer_gender', 'race':'officer_race', 'birth_year':'officer_birth_year',
'current_age':'officer_age'})
# Combine Native American and Pacific Islander, for balance reasons
roster_df['officer_race'] = roster_df['officer_race'].replace('NATIVE AMERICAN/ALASKAN NATIVE','ASIAN/PACIFIC ISLANDER')
roster_df['officer_race'] = roster_df['officer_race'].replace('ASIAN/PACIFIC ISLANDER', 'ASIA_PAC/NATIV_AM')
# Remove NaN and consolidate unknowns
roster_df['unit_description'] = roster_df['unit_description'].fillna('Unknown')
roster_df['unit_description'] = roster_df['unit_description'].apply(lambda x: x.replace('UNKNOWN', 'Unknown'))
# Consolidate traffic units
mask = roster_df['unit_description'].str.contains('TRAFFIC')
roster_df.loc[mask, 'unit_description'] = 'TRAFFIC RELATED'
# Consolidating Airport units
mask = roster_df['unit_description'].str.contains('AIRPORT')
roster_df.loc[mask, 'unit_description'] = 'AIRPORT RELATED'
# Consolidating Public housing units
mask = roster_df['unit_description'].str.contains('PUBLIC HOUSING')
roster_df.loc[mask, 'unit_description'] = 'PUBLIC HOUSING'
# Consolidating YOUTH and juvenile units
mask = (roster_df['unit_description'].str.contains('YOUTH') | roster_df['unit_description'].str.contains('JUVENILE'))
roster_df.loc[mask, 'unit_description'] = 'YOUTH_JUVENILE'
# Consolidate gang units
mask = roster_df['unit_description'].str.contains('GANG')
roster_df.loc[mask, 'unit_description'] = 'GANG ENFORCEMENT_INVESTIGATION'
# Consolidate patrol units
mask = roster_df['unit_description'].str.contains('PATROL')
roster_df.loc[mask, 'unit_description'] = 'PATROL'
# Consolidate forensic units
mask = roster_df['unit_description'].str.contains('FORENSIC')
roster_df.loc[mask, 'unit_description'] = 'FORENSIC'
# Consolidate property units
mask = roster_df['unit_description'].str.contains('PROPERTY') | roster_df['unit_description'].str.contains('ASSET')
roster_df.loc[mask, 'unit_description'] = 'PROPERTY RELATED'
# Consoldiate minority units again, keep district 013 separate
counts = roster_df['unit_description'].value_counts()
roster_df.loc[roster_df['unit_description'].isin(counts[counts < 50].index), 'unit_description'] = 'Other'2 Salary
# Salary data
salary_df = pd.read_csv('./project_data_unified/salary/salary_2002-2017_2017-09.csv')
salary_df['rank'] = salary_df['rank'].astype(str)
# Delete redundant columns
salary_df = salary_df.filter(regex='^(?!salary-)')
salary_df = salary_df.drop(columns=['salary_2002-2017_2017-09_ID.1', 'salary_2002-2017_2017-09_ID', 'org_hire_date', 'year',
'spp_date', 'row_id'])
# Rename
salary_df['rank'] = salary_df['rank'].replace('POLICE OFFICER (ASSIGNED AS DETECTIVE)','PO AS DETECTIVE')
# Collapsing POLICE OFFICERS
mask = salary_df['rank'].str.contains('POLICE OFFICER')
salary_df.loc[mask, 'rank'] = 'POLICE OFFICER'
# Collapsing other POLICE...
mask = salary_df['rank'].str.contains('POLICE') & ~salary_df['rank'].str.contains('POLICE OFFICER (ASSIGNED AS DETECTIVE)')
salary_df.loc[mask, 'rank'] = 'POLICE OTHER'
# RENAME
salary_df = salary_df.rename(columns={'POLICE OFFICER (ASSIGNED AS DETECTIVE)':'PO AS DETECTIVE'})
to_replace = salary_df['rank'].value_counts().index.difference(['POLICE OFFICER', 'SERGEANT', 'COMMANDER', 'CHIEF', 'DEPUTY CHIEF',
'LIEUTENANT', 'EXPLOSIVES TECHNICIAN I', 'POLICE OTHER', 'PO AS DETECTIVE'])
replace_dict = {level: 'Other' for level in to_replace}
salary_df['rank'] = salary_df['rank'].replace(replace_dict)/var/folders/q1/m9fsct652l59y6ctvtwrwg7c0000gn/T/ipykernel_1161/809391773.py:18: UserWarning: This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.
mask = salary_df['rank'].str.contains('POLICE') & ~salary_df['rank'].str.contains('POLICE OFFICER (ASSIGNED AS DETECTIVE)')
2.1 Merge Dataframes
# Merge roster and salary data
roster_sal_df = pd.merge(salary_df, roster_df, on='UID')
# Merge officer roster-salary info with accused_df
merged_df = pd.merge(accused_df, roster_sal_df, on='UID', how='inner')
# Drop duplicates
merged_df = merged_df.drop_duplicates(subset=['UID','cr_id', 'complaint_code'])
#Drop redundant
merged_df = merged_df.drop(columns='link_UID_y')
# Rename
merged_df = merged_df.rename(columns={'link_UID_x':'link_UID'})2.2 Final cleaning: dummy variables, filtering unusable variables, etc.
# Create copy of data frame
df = merged_df.copy()
# Ignore regular expression warning messages
warnings.filterwarnings('ignore')
# DUMMIES::
# unit_description dummy df
unit_df = (pd.get_dummies(df['unit_description'])).astype(float)
unit_df = unit_df.drop(columns='Unknown')
# complaint_category dummy df
complaint_df = (pd.get_dummies(df['complaint_category'])).astype(float)
complaint_df = complaint_df.drop(columns='Other')
# officer gender dummy
gender_df = pd.get_dummies(df['officer_gender'])
# drop female (only need 1 dummy to encode binary)
gender_df = gender_df.drop(columns='FEMALE')
# officer Race dummy
race_df = pd.get_dummies(df['officer_race'])
# on/off duty dummy
duty_df = pd.get_dummies(df['on_off_duty'])
# Drop OFF dummy (only need 1 to encode binary)
duty_df = duty_df.drop(columns='OFF')
# current rank of officer dummies
df['current_rank'] = df['current_rank'].astype(str)
current_rank_df = pd.get_dummies(df['current_rank'])
current_rank_df.columns = current_rank_df.columns.str.replace(' ','_')
# drop UNKNOWN column (avoid perfect multicoll)
current_rank_df = current_rank_df.drop(columns='UNKNOWN')
# rename other column so we don't have 2 of them
current_rank_df = current_rank_df.rename(columns={'Other':'Other_rank'})
# Combine complaint_category + unit_description data frames
complaint_unit_df = pd.concat([complaint_df, unit_df], axis=1)
# Combine gender and race
gender_race_df = pd.concat([race_df, gender_df], axis=1)
# gender + race + unit + complaint category
gender_race_unit_complaint_df = pd.concat([unit_df, complaint_df, gender_df, race_df], axis=1)
# Create full data frame
full_df = pd.concat([gender_race_unit_complaint_df, duty_df, current_rank_df], axis=1)
# Add our 2 continuous variables
full_df['salary'] = df['salary']
full_df['officer_age'] = df['officer_age']
# Add disciplined
full_df['y'] = df['disciplined']
# Clean up column names (make compatiable with smf.)
full_df.columns = full_df.columns.str.replace(' ', '_')
full_df.columns = full_df.columns.str.replace('/', '_')
full_df.columns = full_df.columns.str.replace(')', '')
full_df.columns = full_df.columns.str.replace('(', '_')
full_df.columns = full_df.columns.str.replace('-', '_')
full_df.columns = full_df.columns.str.replace('.', '')
full_df.columns = full_df.columns.str.replace('&', '_')2.2.1 Data preparation
By Sankaranarayanan Balasubramanian and Chun-Li
The following data preparation steps helped us to prepare our data for implementing various modeling / validation techniques:
Since we need to predict house price, we derived some new predictors (from existing predictors) that intuitively seem to be helpuful to predict house price.
We have shuffled the dataset to prepare it for K-fold cross validation.
We have created a standardized version of the dataset, as we will use it to develop Lasso / Ridge regression models.
#Inference 4 prep
getsust = pd.read_csv('officers_only.csv')
cln = pd.read_csv('cleaned_data.csv')
full = pd.concat([cln, getsust[['sustained']]], axis=1)
sust_only = full.loc[full['sustained'] == 1]
sust_only = sust_only.drop(columns = {'sustained'})
sust_only = sust_only.drop(columns = {'Unnamed: 0'})/var/folders/4b/sl3y8zds5g52xl7kl2dsn52w0000gn/T/ipykernel_20516/1618729323.py:2: DtypeWarning: Columns (24) have mixed types. Specify dtype option on import or set low_memory=False.
getsust = pd.read_csv('officers_only.csv')
######---------------Creating new predictors----------------#########
#Creating number of bedrooms per unit floor area
#Creating ratio of bathrooms to bedrooms
#Creating ratio of carpet area to floor area# Jitters points to make distribution more visible
def jitter(values,j):
return values + np.random.normal(j,0.01,values.shape)
# Plot response variable vs salary and officer age, see whether binning is needed
plt.figure(figsize=(14,6))
plt.subplot(1, 2, 1)
ax=sns.scatterplot(df['salary'], jitter(df['disciplined'],0), s=15, edgecolors='white', linewidths=0.2)
plt.xlabel('Salary')
plt.ylabel('Discplined')
ax.xaxis.set_major_formatter('${x:,.0f}')
plt.subplot(1, 2, 2)
ax=sns.scatterplot(full_df['officer_age'], jitter(full_df['y'],0), s=15, edgecolors='white', linewidths=0.2)
plt.xlabel('Officer Age')
plt.ylabel('Discplined')Text(0, 0.5, 'Discplined')
Given the non-uniform distribution of salaries between the two response values, binning salary may be appropriate.
# Bin salary data
def var_transform(data):
binned_salary = pd.qcut(data['salary'],4,retbins=True)
bins = binned_salary[1]
data['salary_binned'] = pd.cut(data['salary'],bins = bins)
dum = pd.get_dummies(data.salary_binned,drop_first = True)
dum.columns = ['salary'+str(x) for x in range(1,len(bins)-1)]
data = pd.concat([data,dum], axis = 1)
return data
salary_bin_data=var_transform(full_df)# Check whether probability of response changes non-monotonically across bins and whether bin number is appropriate
group_salary = salary_bin_data.groupby('salary_binned')['y'].agg([('p_disciplined','mean'),('nobs','count')]).reset_index(drop=False)
plt.plot(group_salary['salary_binned'].astype(str), group_salary['p_disciplined'])
plt.xlabel('Salary $')
plt.ylabel('P(disciplined)')
plt.show()Binning or quadratically transforming salary appears appropriate
# Turn salary bins into dummy variables and concat to main dataframe
salary_D = pd.get_dummies(full_df['salary_binned'])
temp = pd.DataFrame(columns=['salary_1', 'salary_2', 'salary_3', 'salary_4'])
temp['salary_1'] = salary_D.iloc[:,0]
temp['salary_2'] = salary_D.iloc[:,1]
temp['salary_3'] = salary_D.iloc[:,2]
temp['salary_4'] = salary_D.iloc[:,3]
# concat to data frame, drop regular salary and salary binned
full_df = pd.concat([full_df, temp], axis=1)
full_df = full_df.drop(columns=['salary_binned', 'salary'])######-----------Shuffling the dataset for K-fold------------###############-----Standardizing the dataset for Lasso / Ridge-------#########2.3 Exploratory data analysis
Put code with comments. The comments should explain the code such that it can be easily understood. You may put text (in a markdown cell) before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. Put the name of the person / persons who contributed to each code chunk / set of code chunks.
2.3.1 Inference 1: What variables are most associated with whether an officer is disciplined or not?
Strategy: Start with full model –> perform back selection, iteratively removing the predictor with the largest p-value until no more predictors have a p-value of greater than 0.05 –> check for multicollinearity with VIF
# All predictors
Xs = list(full_df.columns.difference(['y']))
full_model = smf.logit('y~{}'.format('+'.join(Xs)), data = full_df).fit(maxiter=500, disp=False)
print('Percentage of 96 predictors with p-values greater than 0.05:',(full_model.pvalues[1:] > 0.05).sum() / len(Xs)*100,'%')Percentage of 96 predictors with p-values greater than 0.05: 84.375 %
# Perform backward selection, removing p-values
# Start with full model
logit_model = full_model
Xs = list(full_df.columns.difference(['y']))
while True:
max_pvalue = max(logit_model.pvalues)
if max_pvalue > 0.05:
# remove the predictor with the largest p-value
predictor_to_remove = logit_model.pvalues.idxmax()
Xs.remove(predictor_to_remove)
logit_model = smf.logit('y~{}'.format('+'.join(Xs)), data=full_df).fit(disp=False)
else:
break
final_back_selected_mod = logit_model
print(final_back_selected_mod.summary()) Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 194142
Model: Logit Df Residuals: 194098
Method: MLE Df Model: 43
Date: Tue, 14 Mar 2023 Pseudo R-squ.: 0.1168
Time: 14:15:38 Log-Likelihood: -31133.
converged: True LL-Null: -35251.
Covariance Type: nonrobust LLR p-value: 0.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept -3.7544 0.134 -28.119 0.000 -4.016 -3.493
ASIA_PAC_NATIV_AM -0.4838 0.091 -5.299 0.000 -0.663 -0.305
BUREAU_OF_INTERNAL_AFFAIRS -0.6114 0.170 -3.597 0.000 -0.945 -0.278
Bribery___Official_Corruption 1.9319 0.175 11.067 0.000 1.590 2.274
CENTRAL_DETENTION_UNIT 0.5467 0.109 5.024 0.000 0.333 0.760
CHIEF -1.1168 0.308 -3.628 0.000 -1.720 -0.514
COMMANDER -0.8792 0.210 -4.180 0.000 -1.291 -0.467
Conduct_Unbecoming__Off_duty 1.9287 0.106 18.227 0.000 1.721 2.136
Criminal_Misconduct 1.1719 0.111 10.567 0.000 0.955 1.389
DETECTIVE_AREA___CENTRAL -0.1535 0.073 -2.102 0.036 -0.297 -0.010
DETECTIVE_AREA___SOUTH -0.2004 0.071 -2.823 0.005 -0.340 -0.061
DETECTIVE_SECTION___AREA_4 -0.7295 0.286 -2.553 0.011 -1.290 -0.169
DISTRICT_001 0.1584 0.055 2.868 0.004 0.050 0.267
DISTRICT_002 0.2082 0.049 4.207 0.000 0.111 0.305
DISTRICT_003 -0.1698 0.065 -2.623 0.009 -0.297 -0.043
DISTRICT_010 0.1707 0.073 2.337 0.019 0.028 0.314
DISTRICT_013 0.3508 0.149 2.352 0.019 0.058 0.643
DISTRICT_014 0.2808 0.075 3.756 0.000 0.134 0.427
DISTRICT_018 0.1508 0.056 2.676 0.007 0.040 0.261
DISTRICT_019 0.2526 0.062 4.086 0.000 0.131 0.374
DISTRICT_025 0.2182 0.075 2.927 0.003 0.072 0.364
DISTRICT_REINSTATEMENT_UNIT 0.7166 0.150 4.767 0.000 0.422 1.011
Domestic 1.3951 0.103 13.526 0.000 1.193 1.597
Drug___Alcohol_Abuse 3.3041 0.123 26.953 0.000 3.064 3.544
False_Arrest -1.5978 0.281 -5.695 0.000 -2.148 -1.048
GANG_ENFORCEMENT_INVESTIGATION -0.4031 0.134 -3.007 0.003 -0.666 -0.140
HISPANIC -0.4504 0.037 -12.092 0.000 -0.523 -0.377
Illegal_Search -1.6488 0.150 -11.011 0.000 -1.942 -1.355
Lockup_Procedures 1.5800 0.093 17.049 0.000 1.398 1.762
MALE -0.1351 0.029 -4.675 0.000 -0.192 -0.078
MOBILE_STRIKE_FORCE -0.5607 0.242 -2.319 0.020 -1.035 -0.087
OEMC___DETAIL_SECTION 0.7195 0.165 4.371 0.000 0.397 1.042
ON -0.7329 0.055 -13.256 0.000 -0.841 -0.625
Operation_Personnel_Violations 1.6856 0.085 19.825 0.000 1.519 1.852
PUBLIC_TRANSPORTATION_SECTION 0.1737 0.066 2.639 0.008 0.045 0.303
RECRUIT_TRAINING_SECTION 2.6503 0.229 11.587 0.000 2.202 3.099
Supervisory_Responsibilities 0.9505 0.131 7.264 0.000 0.694 1.207
Traffic 1.2952 0.102 12.714 0.000 1.095 1.495
Use_of_Force 0.4905 0.092 5.338 0.000 0.310 0.671
WHITE -0.6969 0.027 -26.040 0.000 -0.749 -0.644
officer_age 0.0080 0.002 4.564 0.000 0.005 0.011
salary_2 0.2037 0.039 5.286 0.000 0.128 0.279
salary_3 0.3069 0.042 7.342 0.000 0.225 0.389
salary_4 0.3097 0.049 6.342 0.000 0.214 0.405
==================================================================================================
CHIEF -1.1168 0.308 -3.628 0.000 -1.720 -0.514
COMMANDER -0.8792 final_back_selected_mod.params.sort_values().tail(5)Operation_Personnel_Violations 1.685647
Conduct_Unbecoming__Off_duty 1.928682
Bribery___Official_Corruption 1.931911
RECRUIT_TRAINING_SECTION 2.650347
Drug___Alcohol_Abuse 3.304148
dtype: float64
1 - np.exp(-1.597846)0.7976681281978599
1 - np.exp(-1.648782)0.807716032077367
np.exp(1.685647)5.395940977010438
full_df| ADMIN_SCHOOL_SECURIT | AIRPORT_RELATED | BOMB_AND_ARSON_DIVISION | BUREAU_OF_INTERNAL_AFFAIRS | BUREAU_OF_ORGANIZED_CRIME | CENTRAL_DETENTION_UNIT | CENTRAL_INVESTIGATIONS_DIVISION | DET_DIV_ADMIN | DETACHED_SERVICES___GOVERMENT_SECURITY | DETACHED_SERVICES___MISCELLANEOUS_DETAIL | ... | COMMANDER | Other_rank | PO_AS_DETECTIVE | POLICE_OFFICER | officer_age | y | salary_1 | salary_2 | salary_3 | salary_4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 1 | 69.0 | 1 | 0 | 0 | 0 | 1 |
| 2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 1 | 69.0 | 0 | 0 | 0 | 0 | 1 |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 1 | 69.0 | 0 | 0 | 0 | 0 | 1 |
| 6 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 1 | 69.0 | 0 | 0 | 0 | 0 | 1 |
| 8 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 1 | 69.0 | 0 | 0 | 0 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2519371 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 1 | 49.0 | 1 | 1 | 0 | 0 | 0 |
| 2519373 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 1 | 31.0 | 1 | 1 | 0 | 0 | 0 |
| 2519375 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 1 | 29.0 | 1 | 1 | 0 | 0 | 0 |
| 2519380 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 1 | 32.0 | 1 | 1 | 0 | 0 | 0 |
| 2519385 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 1 | 40.0 | 1 | 1 | 0 | 0 | 0 |
194142 rows × 97 columns
len(final_back_selected_mod.params)44
The back selected model removed 53 predictors from the full model. The next thing to do is to add an additional effect size filter. We are only interested in the predictors that lead to at least a 5% change in the odds ratio of being disciplined.
# Get p-value back selected model's coefficients and
p_back_selected_coefs = list(final_back_selected_mod.params[1:])
p_back_selected_params = list(final_back_selected_mod.params.index.difference(['Intercept']))
odds_ratios = np.exp(p_back_selected_coefs)
effect_sizes = abs(1 - odds_ratios)
# Find if any significant predictors were really weak (changed odds ratio by less than 10%)
weak_ind = np.where(effect_sizes < 0.05)
print('Weak predictor to remove:', p_back_selected_params[weak_ind[0][0]])Weak predictor to remove: officer_age
Finally, let’s check that there aren’t severe multicollinearity problems. By performing p-value back selection, we already likely removed many of the most multicollinear predictors, since predictors with high multicollinearity have large estimated standard errors and therefore large p-values. In fact, this was part of the motivation for back selecting in this manner.
def find_VIF(X): # takes matrix of predictors, calculates VIF for each
X = sm.add_constant(X)
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
for i in range(len(X.columns)):
vif_data.loc[i,'VIF'] = variance_inflation_factor(X.values, i)
return vif_data.sort_values(by='VIF', ascending=False)df['current_rank'].value_counts()POLICE OFFICER 169490
PO AS DETECTIVE 21989
COMMANDER 1241
CHIEF 761
Other 648
UNKNOWN 13
Name: current_rank, dtype: int64
# Remove officer_age
#p_back_selected_params.remove('officer_age')
# Calculate VIF
vif = find_VIF(full_df[p_back_selected_params])find_VIF(full_df.drop('y', axis=1))| feature | VIF | |
|---|---|---|
| 0 | const | 22062.261156 |
| 91 | POLICE_OFFICER | 2139.120552 |
| 90 | PO_AS_DETECTIVE | 1944.288618 |
| 85 | WHITE | 841.181123 |
| 83 | BLACK | 671.560549 |
| ... | ... | ... |
| 9 | DETACHED_SERVICES___GOVERMENT_SECURITY | 4.164397 |
| 93 | salary_1 | 3.454936 |
| 92 | officer_age | 2.440342 |
| 86 | ON | 1.826083 |
| 81 | MALE | 1.051903 |
97 rows × 2 columns
plt.bar(vif.iloc[1:,0], vif.iloc[1:,1])
plt.xticks([], [])
plt.xlabel('Predictors')
plt.ylabel('VIF')
plt.show()print('The 3 predictors with the highest VIF are:', vif.iloc[1,0],',',vif.iloc[2,0],',',vif.iloc[3,0])The 3 predictors with the highest VIF are: Operation_Personnel_Violations , Use_of_Force , Illegal_Search
The predictors only have moderate multicollinearity, indicating that the back selected model is stable and, crucially, that the p-values of the predictors come from precise estimates of their standard errors.
Inference 3 - Sustained or not sustained
df = pd.read_csv('officers_only.csv', low_memory=False)
yes = df[df['sustained'] ==1]
no = df[df['sustained'] == 0].sample(yes.shape[0])
df = pd.concat([no, yes])sns.pairplot(data=df)<seaborn.axisgrid.PairGrid at 0x7fabd0c74a90>
df['salary_binned'] = pd.qcut(df['salary'], q=5)
sns.barplot(data=df, y='sustained', x='salary_binned')<AxesSubplot:xlabel='salary_binned', ylabel='sustained'>
ax = sns.barplot(data=df, y='sustained', x='complaint_category')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, fontsize=12, horizontalalignment='right')
plt.show()ax = sns.barplot(data=df, y='sustained', x='officer_race')
ax.set_xticklabels(ax.get_xticklabels(), rotation=10)
plt.show()ax = sns.barplot(data=df, y='sustained', x='on_off_duty')df[df['on_off_duty'] == 'OFF']['complaint_category'].value_counts()Conduct Unbecoming (Off-duty) 1854
Use of Force 1802
Drug / Alcohol Abuse 229
Domestic 1
Name: complaint_category, dtype: int64
df['complaint_category'].value_counts()Operation/Personnel Violations 12415
Use of Force 5304
Illegal Search 2658
Lockup Procedures 2097
Conduct Unbecoming (Off-duty) 1854
Verbal Abuse 1094
Traffic 1071
Domestic 825
Criminal Misconduct 749
False Arrest 557
Supervisory Responsibilities 498
Drug / Alcohol Abuse 347
Bribery / Official Corruption 111
Other 14
Name: complaint_category, dtype: int64
logit_model = smf.logit(formula = "sustained~complaint_category+salary_binned+unit_description+officer_race+on_off_duty+officer_age+I(officer_age**2)+salary_binned+officer_race+on_off_duty+officer_age+I(officer_age**2)", data = df).fit()This is not the final prediction model but was the formula used for the model to determined the most important variables associated with whether a complaint was sustained or not. Complaint category was easily the most significant. As for as transformations a quadratic transformation of officer age was appropriate based on the shape of the graph and binned salary was another variable that affected the outcome.
Inference 4
getsust = pd.read_csv('officers_only.csv')
df = pd.read_csv('cleaned_data.csv')
full_df = pd.concat([df, getsust[['sustained']]], axis=1)
sust_only = full_df.loc[full_df['sustained'] == 1]
sust_only = sust_only.drop(columns = {'sustained'})
sust_only = sust_only.drop(columns = {'Unnamed: 0'})/var/folders/4b/sl3y8zds5g52xl7kl2dsn52w0000gn/T/ipykernel_20644/75206399.py:1: DtypeWarning: Columns (24) have mixed types. Specify dtype option on import or set low_memory=False.
getsust = pd.read_csv('officers_only.csv')
#Dataframe showing the correlation of each columnn with the target variable
y_df = pd.DataFrame(sust_only.corr()["y"]).sort_values(by = 'y', ascending = False)
# Lasso Regression
q = list(sust_only.columns)
q.remove('y')
G = sust_only[q]
alphas = 10**np.linspace(0,-4,200)*0.5
#Define scaler
scaler = StandardScaler()
scaler.fit(G)
Gstd = scaler.transform(G)
targ = sust_only.y
lassocv = LassoCV(alphas = alphas, cv = 10, max_iter = 100000)
lassocv.fit(Gstd, targ)
lassocv.alpha_0.0029363933065947416
lasso = Lasso(alpha = lassocv.alpha_)
lasso.fit(Gstd, targ)
# create a dataframe with the coefficients from the lasso regression
coefdf = pd.concat([pd.Series(G.columns), pd.Series(lasso.coef_)],axis=1).sort_values(by = 1, ascending = True)def vif(X):
X = add_constant(X)
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
for i in range(len(X.columns)):
vif_data.loc[i, 'VIF'] = variance_inflation_factor(X.values, i)
return(vif_data)# use VIF to determine and remove multicollinear predictors
l = coefdf.iloc[:, 0].tolist()
lass = sust_only[l]
#drop rows with over 5 VIF
viflassdf = vif(lass)
new = viflassdf[viflassdf['VIF'] <= 5]3 Developing the prediction model
3.0.1 Code fitting the prediction model (Alan Senkus)
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.metrics import confusion_matrix
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import precision_score, recall_score
import time
import warningsThe commented out line(s) were used to fit the model to a subset of the data to the sustained allegations.
data_cleaned = pd.read_csv('cleaned_data_w_uids.csv', low_memory=False, index_col=[0])
data_officers_fulldf = pd.read_csv('officers_only_fulldf.csv', low_memory=False, index_col=[0])
data_cleaned = data_cleaned.rename(columns={'y': 'disciplined'})
# data_cleaned = data_cleaned[data_cleaned['sustained'] == 1]
data_cleaned.index = data_officers_fulldf.index
# merge, but if the columns are the same, keep the ones from the clean data
merged = data_officers_fulldf.merge(data_cleaned, how='outer', left_index=True, right_index=True, validate='1:1', suffixes=('_off', ''))
merged = merged.loc[:, ~merged.columns.str.endswith('_off')]
merged.drop(['UID'], axis=1, inplace=True)
# merged = merged[merged['sustained'] == 1]
data_cleaned = mergedSome early leftover EDA.
# data_cleaned.corr()['sustained'].sort_values(ascending=False)
# Sustained shouldn't be a predictor. We want to look at all allegations.
data_sustained = data_cleaned['sustained']
data_cleaned.drop('sustained', axis=1, inplace=True)
# What percentage of complaints are disciplined?
data_cleaned['disciplined'].value_counts(normalize=True)0 0.955636
1 0.044364
Name: disciplined, dtype: float64
Noise code credit to Alex
# Add some noise to salary
data_cleaned['salary'] = (data_cleaned['salary'] + np.random.normal(0,5,data_cleaned.shape[0]))
# Add some noise to age
data_cleaned['officer_age'] = data_cleaned['officer_age'] + np.random.normal(0,1.5,data_cleaned.shape[0])This was leftover from quick prototyping as VIF analysis took a long time.
# Only include columns with correlation >0.005 with disciplined (quick and dirty but theres a LOT of columns so)
# data_cleaned = data_cleaned[data_cleaned.columns[data_cleaned.corr()['disciplined'].abs() > 0.005]]# Split the data into features (X) and target (y)
X = data_cleaned.drop('disciplined', axis=1)
y = data_cleaned['disciplined']
# add intercept
# X = sm.add_constant(X)
# Remove all perfect correlations with other variables
# X = X.loc[:,~X.columns.duplicated()]# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# combine X_train_resampled and y_train_resampled into a dataframe
# X_train_resampled = pd.DataFrame(X_train_resampled, columns=X_train.columns)We eliminate columns that don’t change at all.
# Drop singular matrix columns
sing_matrix = []
for col in X_train.columns:
if X_train[col].nunique() == 1:
sing_matrix.append(col)
X_train.drop(sing_matrix, axis=1, inplace=True)
X_test.drop(sing_matrix, axis=1, inplace=True)As outlined in the Final Report, we chose 20 for our VIF threshold to give it a little more wiggle room. Our data is very sparse & from the real world, so we wanted to give ourselves a chance.
# Calculate VIF for each feature
vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
# Remove features from X_train and X_test with VIF > 10
vif['features'] = X_train.columns
vif = vif[vif['VIF Factor'] > 40]
X_train.drop(vif['features'], axis=1, inplace=True)
X_test.drop(vif['features'], axis=1, inplace=True)
X_test = sm.add_constant(X_test)/opt/anaconda3/lib/python3.9/site-packages/statsmodels/stats/outliers_influence.py:195: RuntimeWarning: divide by zero encountered in double_scalars
vif = 1. / (1. - r_squared_i)
SMOTE was a suggestion by Prof Krish after our presentation – Thank you! We used it to balance the apperance of the two classes in our data. We experimented with different ratios, but did not see a significant difference in the results.
# Apply SMOTE to the training set
smote = SMOTE(sampling_strategy=1, random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)Forward selection code, adapted from the class notes. Here, we minimize the AIC.
Part of the reason we chose AIC was because of this: “The Bayesian Information Criterion (BIC) is more useful in selecting a correct model while the AIC is more appropriate in finding the best model for predicting future observations.” (https://www.sciencedirect.com/science/article/pii/B9780444518620500186)
#Function to develop a model based on all predictors in predictor_subset
def processSubset(predictor_subset):
# Fit model on feature_set and calculate R-squared
# print("Predictors:", predictor_subset)
# print("Y:", y_train_resampled)
# print("X:", X_train_resampled[predictor_subset])
# always include intercept
# predictor_subset = predictor_subset + ['const']
with_intercept = pd.DataFrame(sm.add_constant(X_train_resampled[predictor_subset]))
model = sm.Logit(y_train_resampled, with_intercept).fit(disp=False)
# max recall
# model = sm.Logit(y_train_resampled, X_train_resampled).fit(disp=False, method='newton')
score = model.aic
return {"model":model, "score":score}
#Function to find the best predictor out of p-k predictors and add it to the model containing the k predictors
def forward(pred):
# Pull out predictors we still need to process
remaining_predictors = [p for p in X_train_resampled.columns if p not in pred]
tic = time.time()
results = []
for p in remaining_predictors:
results.append(processSubset(pred+[p]))
# Wrap everything up in a nice dataframe
models = pd.DataFrame(results)
# Choose the model with the lower AIC
best_model = models.loc[models['score'].argmin()]
toc = time.time()
print("Processed ", models.shape[0], "models on", len(pred)+1, "predictors in", (toc-tic), "seconds.")
# Return the best model, along with some other useful information about the model
return best_model
def forward_selection():
models_best = pd.DataFrame(columns=["score", "model"])
tic = time.time()
predictors = []
for i in range(1,len(X_train_resampled.columns)+1):
# print("Progress:", i, "/", len(X_train_resampled.columns))
models_best.loc[i] = forward(predictors)
predictors = list(models_best.loc[i]["model"].params.index[1:])
toc = time.time()
print("Total elapsed time:", (toc-tic), "seconds.")
return models_best
models_best = forward_selection()Processed 30 models on 1 predictors in 1.5637402534484863 seconds.
Processed 29 models on 2 predictors in 2.0958340167999268 seconds.
Processed 28 models on 3 predictors in 2.447935104370117 seconds.
Processed 27 models on 4 predictors in 2.4803669452667236 seconds.
Processed 26 models on 5 predictors in 2.8363490104675293 seconds.
Processed 25 models on 6 predictors in 3.134242057800293 seconds.
Processed 24 models on 7 predictors in 3.307857036590576 seconds.
Processed 23 models on 8 predictors in 3.3351292610168457 seconds.
Processed 22 models on 9 predictors in 3.304680824279785 seconds.
Processed 21 models on 10 predictors in 3.379564046859741 seconds.
Processed 20 models on 11 predictors in 3.352635145187378 seconds.
Processed 19 models on 12 predictors in 4.098873853683472 seconds.
Processed 18 models on 13 predictors in 3.4987878799438477 seconds.
Processed 17 models on 14 predictors in 3.57153582572937 seconds.
Processed 16 models on 15 predictors in 3.7793631553649902 seconds.
Processed 15 models on 16 predictors in 3.4337270259857178 seconds.
Processed 14 models on 17 predictors in 3.1700479984283447 seconds.
Processed 13 models on 18 predictors in 3.2971270084381104 seconds.
Processed 12 models on 19 predictors in 2.9969730377197266 seconds.
Processed 11 models on 20 predictors in 3.052708148956299 seconds.
Processed 10 models on 21 predictors in 3.0324201583862305 seconds.
Processed 9 models on 22 predictors in 2.602311134338379 seconds.
Processed 8 models on 23 predictors in 2.3304378986358643 seconds.
Processed 7 models on 24 predictors in 2.365370750427246 seconds.
Processed 6 models on 25 predictors in 2.0494720935821533 seconds.
Processed 5 models on 26 predictors in 1.7375271320343018 seconds.
Processed 4 models on 27 predictors in 1.4599499702453613 seconds.
Processed 3 models on 28 predictors in 1.0515141487121582 seconds.
Processed 2 models on 29 predictors in 0.8562498092651367 seconds.
Processed 1 models on 30 predictors in 0.5110898017883301 seconds.
Total elapsed time: 80.5423309803009 seconds.
models_best['score'] = models_best['score'].astype(float)
models_best.sort_values(by='score', ascending=True, inplace=True)
best_model = models_best.iloc[0]['model']
best_model_params = best_model.params.index.tolist()
# X_train = X_train[best_model_params]
X_test_best = X_test[best_model_params]
best_model.summary()| Dep. Variable: | disciplined | No. Observations: | 296786 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 296758 |
| Method: | MLE | Df Model: | 27 |
| Date: | Wed, 15 Mar 2023 | Pseudo R-squ.: | 0.03464 |
| Time: | 23:39:04 | Log-Likelihood: | -1.9859e+05 |
| converged: | True | LL-Null: | -2.0572e+05 |
| Covariance Type: | nonrobust | LLR p-value: | 0.000 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 1.0667 | 0.014 | 78.533 | 0.000 | 1.040 | 1.093 |
| ON | -0.8556 | 0.011 | -76.775 | 0.000 | -0.877 | -0.834 |
| Drug___Alcohol_Abuse | 2.0675 | 0.056 | 36.742 | 0.000 | 1.957 | 2.178 |
| MALE | -0.4324 | 0.010 | -42.859 | 0.000 | -0.452 | -0.413 |
| RECRUIT_TRAINING_SECTION | 2.3656 | 0.152 | 15.588 | 0.000 | 2.068 | 2.663 |
| DISTRICT_REINSTATEMENT_UNIT | 1.2144 | 0.067 | 18.204 | 0.000 | 1.084 | 1.345 |
| OEMC___DETAIL_SECTION | 1.4279 | 0.084 | 17.079 | 0.000 | 1.264 | 1.592 |
| DETECTIVE_SECTION___AREA_4 | -1.5707 | 0.110 | -14.258 | 0.000 | -1.787 | -1.355 |
| Bribery___Official_Corruption | 0.9190 | 0.061 | 15.154 | 0.000 | 0.800 | 1.038 |
| CENTRAL_INVESTIGATIONS_DIVISION | -1.1213 | 0.095 | -11.755 | 0.000 | -1.308 | -0.934 |
| Supervisory_Responsibilities | -0.3038 | 0.033 | -9.302 | 0.000 | -0.368 | -0.240 |
| MOUNTED_UNIT | -0.8688 | 0.101 | -8.572 | 0.000 | -1.067 | -0.670 |
| BUREAU_OF_ORGANIZED_CRIME | -2.2350 | 0.365 | -6.123 | 0.000 | -2.950 | -1.520 |
| OFFICE_OF_THE_FIRST_DEPUTY_SUPERINTENDENT | -0.7498 | 0.115 | -6.518 | 0.000 | -0.975 | -0.524 |
| MARINE_OPERATIONS_UNIT | -0.4290 | 0.073 | -5.917 | 0.000 | -0.571 | -0.287 |
| HUMAN_RESOURCES_DIVISION | -0.5299 | 0.103 | -5.159 | 0.000 | -0.731 | -0.329 |
| DETACHED_SERVICES___MISCELLANEOUS_DETAIL | 0.5813 | 0.117 | 4.976 | 0.000 | 0.352 | 0.810 |
| BOMB_AND_ARSON_DIVISION | 0.2911 | 0.074 | 3.937 | 0.000 | 0.146 | 0.436 |
| MAJOR_ACCIDENT_INVESTIGATION_UNIT | -0.3106 | 0.080 | -3.860 | 0.000 | -0.468 | -0.153 |
| PREVEN___NEIGH_DIV | 0.5266 | 0.166 | 3.169 | 0.002 | 0.201 | 0.852 |
| POLICE_DOCUMENTS_SECTION | -0.6116 | 0.200 | -3.051 | 0.002 | -1.004 | -0.219 |
| SPECIAL_INVESTIGATIONS_UNIT | -0.2175 | 0.083 | -2.626 | 0.009 | -0.380 | -0.055 |
| FIELD_SERVICES_SECTION | -0.2754 | 0.108 | -2.558 | 0.011 | -0.486 | -0.064 |
| INTELLIGENCE_SECTION | -0.6550 | 0.281 | -2.335 | 0.020 | -1.205 | -0.105 |
| INSPECTION_DIVISION | 0.2306 | 0.115 | 2.007 | 0.045 | 0.005 | 0.456 |
| DET_DIV_ADMIN | -0.3208 | 0.183 | -1.753 | 0.080 | -0.679 | 0.038 |
| ADMIN_SCHOOL_SECURIT | -0.2834 | 0.189 | -1.497 | 0.134 | -0.654 | 0.088 |
| PUBLIC_HOUSING | -0.1175 | 0.079 | -1.479 | 0.139 | -0.273 | 0.038 |
We also compared it to a base logit model.
# full logit model (all predictors available in forward selection)
full_logit = sm.Logit(y_train_resampled, sm.add_constant(X_train_resampled)).fit(disp=False)
full_logit.summary()| Dep. Variable: | disciplined | No. Observations: | 296786 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 296755 |
| Method: | MLE | Df Model: | 30 |
| Date: | Wed, 15 Mar 2023 | Pseudo R-squ.: | 0.03465 |
| Time: | 23:39:05 | Log-Likelihood: | -1.9859e+05 |
| converged: | True | LL-Null: | -2.0572e+05 |
| Covariance Type: | nonrobust | LLR p-value: | 0.000 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 1.0670 | 0.014 | 78.525 | 0.000 | 1.040 | 1.094 |
| ADMIN_SCHOOL_SECURIT | -0.2838 | 0.189 | -1.499 | 0.134 | -0.655 | 0.087 |
| BOMB_AND_ARSON_DIVISION | 0.3006 | 0.075 | 4.022 | 0.000 | 0.154 | 0.447 |
| BUREAU_OF_ORGANIZED_CRIME | -2.2355 | 0.365 | -6.124 | 0.000 | -2.951 | -1.520 |
| CENTRAL_INVESTIGATIONS_DIVISION | -1.1217 | 0.095 | -11.759 | 0.000 | -1.309 | -0.935 |
| DET_DIV_ADMIN | -0.3213 | 0.183 | -1.756 | 0.079 | -0.680 | 0.037 |
| DETACHED_SERVICES___GOVERMENT_SECURITY | 0.0151 | 0.232 | 0.065 | 0.948 | -0.440 | 0.470 |
| DETACHED_SERVICES___MISCELLANEOUS_DETAIL | 0.5808 | 0.117 | 4.973 | 0.000 | 0.352 | 0.810 |
| DETECTIVE_SECTION___AREA_4 | -1.5711 | 0.110 | -14.262 | 0.000 | -1.787 | -1.355 |
| DISTRICT_REINSTATEMENT_UNIT | 1.2140 | 0.067 | 18.198 | 0.000 | 1.083 | 1.345 |
| FIELD_SERVICES_SECTION | -0.2758 | 0.108 | -2.562 | 0.010 | -0.487 | -0.065 |
| HUMAN_RESOURCES_DIVISION | -0.5298 | 0.103 | -5.157 | 0.000 | -0.731 | -0.328 |
| INSPECTION_DIVISION | 0.2301 | 0.115 | 2.003 | 0.045 | 0.005 | 0.455 |
| INTELLIGENCE_SECTION | -0.6555 | 0.281 | -2.336 | 0.019 | -1.205 | -0.106 |
| MAJOR_ACCIDENT_INVESTIGATION_UNIT | -0.3110 | 0.080 | -3.865 | 0.000 | -0.469 | -0.153 |
| MARINE_OPERATIONS_UNIT | -0.4294 | 0.073 | -5.922 | 0.000 | -0.572 | -0.287 |
| MOUNTED_UNIT | -0.8692 | 0.101 | -8.577 | 0.000 | -1.068 | -0.671 |
| OEMC___DETAIL_SECTION | 1.4275 | 0.084 | 17.074 | 0.000 | 1.264 | 1.591 |
| OFFICE_OF_THE_FIRST_DEPUTY_SUPERINTENDENT | -0.7311 | 0.117 | -6.248 | 0.000 | -0.960 | -0.502 |
| PATROL | -0.0938 | 0.071 | -1.326 | 0.185 | -0.233 | 0.045 |
| POLICE_DOCUMENTS_SECTION | -0.6121 | 0.200 | -3.054 | 0.002 | -1.005 | -0.219 |
| PREVEN___NEIGH_DIV | 0.5261 | 0.166 | 3.166 | 0.002 | 0.200 | 0.852 |
| PUBLIC_HOUSING | -0.1179 | 0.079 | -1.485 | 0.138 | -0.274 | 0.038 |
| RECRUIT_TRAINING_SECTION | 2.3652 | 0.152 | 15.585 | 0.000 | 2.068 | 2.663 |
| SPECIAL_INVESTIGATIONS_UNIT | -0.2180 | 0.083 | -2.631 | 0.009 | -0.380 | -0.056 |
| Bribery___Official_Corruption | 0.9187 | 0.061 | 15.149 | 0.000 | 0.800 | 1.038 |
| Drug___Alcohol_Abuse | 2.0672 | 0.056 | 36.736 | 0.000 | 1.957 | 2.178 |
| Supervisory_Responsibilities | -0.3025 | 0.033 | -9.258 | 0.000 | -0.367 | -0.238 |
| MALE | -0.4323 | 0.010 | -42.831 | 0.000 | -0.452 | -0.413 |
| ON | -0.8555 | 0.011 | -76.749 | 0.000 | -0.877 | -0.834 |
| Other_rank | -0.0637 | 0.071 | -0.897 | 0.370 | -0.203 | 0.075 |
# base logit model (just the intercept)
base_logit = sm.Logit(y_train_resampled, sm.add_constant(pd.DataFrame(np.ones(y_train_resampled.shape[0])))).fit(disp=False)
base_logit.summary()| Dep. Variable: | disciplined | No. Observations: | 296786 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 296785 |
| Method: | MLE | Df Model: | 0 |
| Date: | Wed, 15 Mar 2023 | Pseudo R-squ.: | 0.000 |
| Time: | 23:39:05 | Log-Likelihood: | -2.0572e+05 |
| converged: | True | LL-Null: | -2.0572e+05 |
| Covariance Type: | nonrobust | LLR p-value: | nan |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| 0 | 0 | 0.004 | 0 | 1.000 | -0.007 | 0.007 |
Confusion matrix is adapted from the class notes.
threshold = 0.5
def plot_confusion_matrix(y_test, y_pred, name):
# Confusion matrix plot (percentages) multiplot of each model
plt.figure(figsize=(30, 10))
plt.suptitle(name, fontsize=16, fontweight='bold')
plt.subplot(1, 3, 1)
cm = confusion_matrix(y_test, y_pred)
# cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm, annot=True, cmap='Blues')
plt.title('based on total observations (Num)')
plt.xlabel('Predicted (Num)')
plt.ylabel('Actual (Num)')
# plt.show()
plt.subplot(1, 3, 2)
# cm = confusion_matrix(y_test, y_pred)
# cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
# sum across all elements and divide by total number of elements
cm2 = cm.astype('float') / cm.sum()
sns.heatmap(cm2, annot=True, cmap='Reds')
plt.title('Based on proportion of total observations')
plt.xlabel('Predicted %')
plt.ylabel('Actual %')
plt.subplot(1, 3, 3)
# cm = confusion_matrix(y_test, y_pred)
cm3 = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
# sum across all elements and divide by total number of elements
# cm = cm.astype('float') / cm.sum()
sns.heatmap(cm3, annot=True, cmap='Greens')
plt.title('Based on proportion of actual (row) observations')
plt.xlabel('Predicted %')
plt.ylabel('Actual %')
# plt.show()
# display precision and recall in text
# plt.subplot(2, 3, 4)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
plt.text(0.1, -0.1, 'Precision: ' + str(prec), fontsize=20)
plt.text(0.1, -0.2, 'Recall: ' + str(rec), fontsize=20)
print('Precision of ' + name + ': ', precision_score(y_test, y_pred))
print('Recall of ' + name + ': ', recall_score(y_test, y_pred))
def test_and_graph_model(model, x_test, y_test, name="Model"):
# Test the model
y_pred = model.predict(x_test)
# Check the accuracy of the model
y_pred = np.where(y_pred > threshold, 1, 0)
print('Accuracy of the model: ', (y_pred == y_test).mean())
plot_confusion_matrix(y_test, y_pred, name)Here, we choose a minimum acceptable precision (0.1) and find the best recall we can get with that precision.
This strategy was adapted from: https://towardsdatascience.com/calculating-and-setting-thresholds-to-optimise-logistic-regression-performance-c77e6d112d7e
Although we tried it, F1 score and TPR-NPR did not work well for our model nor our requirements. Being explicit and maximizing recall on a minimum precision was the ultimate strategy we chose.
import numpy as np
from sklearn.metrics import precision_recall_curve
X_train['const'] = 1
def find_best_threshold_max_recall(model, X_test, y_test, min_precision=0.5):
y_pred_probs = model.predict(X_test)
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_probs)
# Find indices where precision is greater than or equal to the minimum acceptable precision
acceptable_precision_indices = np.where(precision[:-1] >= min_precision)
# Select the corresponding recall and threshold values
acceptable_recall = recall[acceptable_precision_indices]
acceptable_thresholds = thresholds[acceptable_precision_indices]
# Find the index of the maximum recall value
max_recall_index = np.argmax(acceptable_recall)
# Find the best threshold value
best_threshold = acceptable_thresholds[max_recall_index]
return best_threshold
min_precision = 0.05 # This value can be adjusted.
best_threshold_max_recall = find_best_threshold_max_recall(best_model, X_test_best, y_test, min_precision)
print("Best threshold value for maximizing recall while maintaining a minimum precision of", min_precision, ":", best_threshold_max_recall)Best threshold value for maximizing recall while maintaining a minimum precision of 0.05 : 0.44574496519007095
# regular models (no threshold adjustment)
threshold = 0.5
# print("### BEST Model, No Threshold Adjustment ###")
plt.figure(figsize=(30, 10))
test_and_graph_model(best_model, X_test_best, y_test, name="BEST Model, No Threshold Adjustment")
# print("### BASE Model, No Threshold Adjustment ###")
plt.figure(figsize=(30, 10))
test_and_graph_model(base_logit, X_test['const'], y_test, name="BASE Model, No Threshold Adjustment")
# print("### FULL Model, No Threshold Adjustment ###")
plt.figure(figsize=(30, 10))
test_and_graph_model(full_logit, X_test, y_test, name="FULL Model, No Threshold Adjustment")Accuracy of the model: 0.7617502382240078
Precision of BEST Model, No Threshold Adjustment: 0.07926965041193498
Recall of BEST Model, No Threshold Adjustment: 0.4205552274069699
Accuracy of the model: 0.9563985680805583
Precision of BASE Model, No Threshold Adjustment: 0.0
Recall of BASE Model, No Threshold Adjustment: 0.0
Accuracy of the model: 0.7617502382240078
Precision of FULL Model, No Threshold Adjustment: 0.07926965041193498
Recall of FULL Model, No Threshold Adjustment: 0.4205552274069699
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
<Figure size 3000x1000 with 0 Axes>
<Figure size 3000x1000 with 0 Axes>
<Figure size 3000x1000 with 0 Axes>
# adjusted models (threshold adjustment)
threshold = best_threshold_max_recall
# print("### BEST Model, No Threshold Adjustment ###")
plt.figure(figsize=(30, 10))
test_and_graph_model(best_model, X_test_best, y_test, name="BEST Model, With Threshold Adjustment")
# print("### BASE Model, No Threshold Adjustment ###")
plt.figure(figsize=(30, 10))
test_and_graph_model(base_logit, X_test['const'], y_test, name="BASE Model, With Threshold Adjustment")
# print("### FULL Model, No Threshold Adjustment ###")
plt.figure(figsize=(30, 10))
test_and_graph_model(full_logit, X_test, y_test, name="FULL Model, With Threshold Adjustment")Accuracy of the model: 0.757887146205156
Precision of BEST Model, With Threshold Adjustment: 0.07907383136740935
Recall of BEST Model, With Threshold Adjustment: 0.42764323685764916
Accuracy of the model: 0.04360143191944166
Precision of BASE Model, With Threshold Adjustment: 0.04360143191944166
Recall of BASE Model, With Threshold Adjustment: 1.0
Accuracy of the model: 0.7576296067372325
Precision of FULL Model, With Threshold Adjustment: 0.07898756273183505
Recall of FULL Model, With Threshold Adjustment: 0.42764323685764916
<Figure size 3000x1000 with 0 Axes>
<Figure size 3000x1000 with 0 Axes>
<Figure size 3000x1000 with 0 Axes>
from sklearn.metrics import precision_recall_curve
ypred = best_model.predict(X_train[best_model_params])
p, r, thresholds = precision_recall_curve(y_train, ypred)
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
plt.figure(figsize=(8, 8))
plt.title("Precision and Recall Scores as a function of the decision threshold")
plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.ylabel("Score")
plt.xlabel("Decision Threshold")
plt.legend(loc='best')
plt.legend()
plot_precision_recall_vs_threshold(p, r, thresholds)3.0.2 This is how our model would perform on a test dataset with 50/50 split of disciplined and not.
Just for reference, assuming a beautiful world where our model might be useful!
# y_test_half and X_test_half are the test set with 50/50 appearance of 0 and 1
y_test_half = y_test.copy()
X_test_half = X_test.copy()
combined_test = pd.concat([X_test, y_test], axis=1)
# zeros and ones are the test set split into two groups, 0 and 1
zeros = combined_test[combined_test[y_test.name] == 0]
ones = combined_test[combined_test[y_test.name] == 1]
min_count = min(len(zeros), len(ones))
sampled_zeros = zeros.sample(min_count)
sampled_ones = ones.sample(min_count)
balanced_test = pd.concat([sampled_zeros, sampled_ones], axis=0)
balanced_test = balanced_test.sample(frac=1).reset_index(drop=True)
X_test_half = balanced_test.drop(columns=[y_test.name])
y_test_half = balanced_test[y_test.name]# regular models (no threshold adjustment)
threshold = 0.5
# print("### BEST Model, No Threshold Adjustment ###")
plt.figure(figsize=(30, 10))
test_and_graph_model(best_model, X_test_half[best_model_params], y_test_half, name="BEST Model, No Threshold Adjustment")
# print("### BASE Model, No Threshold Adjustment ###")
plt.figure(figsize=(30, 10))
test_and_graph_model(base_logit, X_test_half['const'], y_test_half, name="BASE Model, No Threshold Adjustment")
# print("### FULL Model, No Threshold Adjustment ###")
plt.figure(figsize=(30, 10))
test_and_graph_model(full_logit, X_test_half, y_test_half, name="FULL Model, No Threshold Adjustment")Accuracy of the model: 0.600708800945068
Precision of BEST Model, No Threshold Adjustment: 0.6574330563250231
Recall of BEST Model, No Threshold Adjustment: 0.4205552274069699
Accuracy of the model: 0.5
Precision of BASE Model, No Threshold Adjustment: 0.0
Recall of BASE Model, No Threshold Adjustment: 0.0
Accuracy of the model: 0.600708800945068
Precision of FULL Model, No Threshold Adjustment: 0.6574330563250231
Recall of FULL Model, No Threshold Adjustment: 0.4205552274069699
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
<Figure size 3000x1000 with 0 Axes>
<Figure size 3000x1000 with 0 Axes>
<Figure size 3000x1000 with 0 Axes>
3.1 Conclusions and Recommendations to stakeholder(s)
At this time, our conclusions to our shareholders are as follows: 1) More daata is needed to develop a useful predictive model, or 2) Human judgement is still needed to make the final decision on whether an officer should be disciplined or not, until GPT can learn to replace us in that too.
For more detailed information, see the final report.